Skip to content

Design document for RemoteRepository DB normalization #7169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 4, 2021

Conversation

humitos
Copy link
Member

@humitos humitos commented Jun 9, 2020

Initial document to discuss the path forward to implement

Reference: #2759

Initial document to discuss the path forward to implement
#2759
@humitos humitos requested a review from a team June 9, 2020 14:09
@ericholscher
Copy link
Member

This makes sense to me. I wonder if it's worth considering only storing the data we actually care about, and creating model fields for it, instead of the entire JSON blob? I guess we've hit times where we wanted extra data that we didn't store previously, but we could always re-query the API to get it, if needed. That would also cut down a good bit on the amount of data storage required.

@humitos
Copy link
Member Author

humitos commented Jun 10, 2020

That's a good question. One thing I'd mention is that having the json field as is helped me to understand the previous modelling, how it was used and the differences between the "duplicated" RemoteRepository objects.

I think it's a trade off between adding more work from our side (add these fields, populate them and maintain this code) vs. db storage.

There is a lot of API URLs in the response that most of them are useless for our case. The rest of the data looks useful, though. Maybe keep using a JSON field but removing these URLs before storing it?

I have a slightly preference to keep the whole json as is and rely on Postgres JSON internals, because seems simpler and safer.

@humitos
Copy link
Member Author

humitos commented Jun 10, 2020

Also, something to note is that the json field is independent from the Social Network, so it will definitely have different data.

@ericholscher
Copy link
Member

Yea, I think at least trimming known useless data would be good. I mostly just want to get the data size reduced.

The other issue is, what is the API response format changes? It seems like that could wreck havoc on our side if we aren't parsing the incoming content to get the values we want.

@humitos
Copy link
Member Author

humitos commented Jun 15, 2020

Yea, I think at least trimming known useless data would be good. I mostly just want to get the data size reduced.

I replied too quickly last time. We are already parsing the relevant data and storing it as RemoteRepository fields outside the json field.

So, we may not require the json field at all. Instead, we may need to add some extra field in case we are missing something important.

The other issue is, what is the API response format changes? It seems like that could wreck havoc on our side if we aren't parsing the incoming content to get the values we want.

If the API response change, even if we are parsing the relevant fields, it may break. We do not have have a good way to protect ourselves here. To avoid this, we should pin to an API version on each service (e.g. https://developer.github.com/v3/media/#request-specific-version)

Copy link
Member Author

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things have changed since I wrote this document. Besides, SSO is already implemented and working with the current modeling: no JSON is required at all. We could completely remove this field if strongly required or think more about it.

I added some comments to keep in mind when implementing this design.


* Keep ``RemoteRepository`` in sync with GitHub repositories.
* Delete ``RemoteRepository`` objects deleted from GitHub.
* Listen to GitHub events to detect ``full_name`` changes and update our objects.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a remote_id= field, this sync is not even needed.

* Save only one ``RemoteRepository`` per GitHub repository.
* Use an intermediate table between ``RemoteRepository`` and ``User`` to store associated remote data for the specific user.
* Make this model usable from our SSO implementation.
* Use Post ``JSONField`` to store associated ``json`` remote data.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be considered a "Nice to have" in case we don't have the time to make all the changes required to depend on PostgreSQL in tests. If we keep the admin= field in the relationship, filtering over the JSON is not needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably wait here until we migrate to Django 3.x that implements this natively and avoid making this migration twice. We don't really need json field right now.

@humitos
Copy link
Member Author

humitos commented Jul 30, 2020

@ericholscher I updated the document with the latest changes in context. This is ready to merge IMO, and we can workaround some implementation details in the PR that implements this. I'm tempt to say that we can remove the json field completely if we are not going to use it anyways --we have all the data we need in DB model fields.

@stale
Copy link

stale bot commented Sep 13, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: stale Issue will be considered inactive soon label Sep 13, 2020
@humitos humitos added Accepted Accepted issue on our roadmap and removed Status: stale Issue will be considered inactive soon labels Sep 14, 2020
@humitos
Copy link
Member Author

humitos commented Sep 14, 2020

This is required. It's ready for review and merge, IMO. The implementation will take some time, tho.

@humitos
Copy link
Member Author

humitos commented Dec 14, 2020

I'm merging this since we are already implementing this and this document reflects almost exactly what we are doing. I just added a "Rollout plan" section with more info about this.

@humitos humitos merged commit cb17c66 into master Jan 4, 2021
@humitos humitos deleted the humitos/design-remote-repository-normalization branch January 4, 2021 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants